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Abstract 

We consider block codes whose rate converges to the channel capacity with increasing block length at a certain 
speed and examine the best possible decay of the probability of error. We prove that a moderate deviation principle 
holds for all convergence rates between the large deviation and the central limit theorem regimes. 



I. Introduction 

In block channel coding, there is a fundamental interplay between the rate, i.e., the amount of information 
transmitted per channel use, the block length, i.e., the total number of channel uses, and the probability of error. 
In this paper, we analyze the interplay between these three parameters for the best block codes. Specifically, we 
address the following question: for a given discrete memoryless channel, what is the fastest rate at which the error 
probability can decay to zero if the rate increases to the channel capacity with increasing block length? We begin 
by reviewing the literature on the interaction between these three basic parameters. 

Shannon [1] formulated the channel coding problem and characterized the largest fixed rate such that the error 
probability could be driven to zero with increasing block length. Later, Strassen |2] considered the following more- 
^ | refined characterization. Given a block length and an e G (0, 1), what is the largest possible rate of a code with 
O maximal error probability less than or equal to e? If e G (0, 1/2), then Strassen showed that this rate is equal to 

V n \ n / 

! where C denotes the channel capacity, $ denotes the standard Gaussian distribution, and a 2 (W) is a statistic of 
Q\ ' the channel defined later. More recently, Polyanskiy et al. 0, 0, provided an improved characterization of the 
■ C((logn)/n) term and extended the result to Gaussian channels. Following the convention of 0, we call a 2 (W) 
0^ ! the dispersion of the channel. We note that although Strassen's result is classical, there is a renewed interest in his 
^ ; setup; see, e.g., ||5l- l|T5l , and references therein. 

Another approach to the characterization of the interplay between rate, block length, and the probability of error is 
^s! i the so-called error exponents, which can be formulated as follows. Given a discrete memoryless channel and a fixed 
rate below the capacitjfl, what is the best exponential rate of decay of the error probability with the block length? 
Classical results characterized the best exponent at rates close to capacity for a broad class of channels Ifl6l - ll22"l . 

Our result lies between Strassen's result and error exponents in the sense that we require the rate to approach 
capacity and the error probability to simultaneously tend to zero. This formulation is arguably more relevant to 
practical code design than either error exponents or Strassen's result. The goal in channel coding is, after all, to 
attain a rate that is close to capacity and an error probability that is close to zero. Although error exponents allow 
for vanishing error probabilities, the rate is bounded away from capacity. In Strassen's result, on the other hand, 
the rate approaches capacity, but the error probability is bounded away from zero. 

To place this formulation in context, it is helpful to consider the more-elementary setup of a sum of independent 
and identically distributed (i.i.d.) random variables. If we scale the sum with 1/n, it converges to the mean by 
the law of large numbers. Cramer's Theorem (e.g. ll23l Theorem 2.2.3]) characterizes the probability that the 
unnormalized sum makes an order-n deviation from its mean. This probability decays exponentially in n, and 
Cramer's characterization of the exponent is now termed a large deviations result. The central limit theorem, on the 
other hand, characterizes the probability that the unnormalized sum makes an order-y'n deviation. As n tends to 

The material in this paper was presented in part at the 2010 IEEE International Symposium on Information Theory (ISIT), Austin, TX. 
The authors are with the School of Electrical and Computer Engineering, Cornell University, Ithaca, NY 14853, USA. E-mail: 
ya68@ Cornell, edu, wagner@ece. Cornell, edu. 

'in the literature, there is a considerable amount of work on the error exponents for rates above the capacity, not only for discrete 
memoryless channels, but also for various other problems, as well. However, we shall only be concerned with rates below the capacity here. 
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infinity, this probability converges to a positive constant that is governed by the Normal distribution. Likewise, one 
can characterize the probability that the unnormalized sum makes a deviation whose size lies between these two 
extremes [23, Theorem 3.7.1] This is now called a moderate deviations result. Error exponents in channel coding 
are analogous to large deviations for i.i.d. sums, in that they both characterize exponentially small probabilities 
using similar techniques. Strassen's result is akin to the central limit theorem; indeed, it is sometimes called the 
normal approximation. The result in this paper is analogous to moderation deviations. 

Although moderate deviations have been a fixture of probability theory for some time (e.g., Il24l - ll26l . If27l 
Sec. XVI.7], If28l Chapter 8] and references therein), they appeared in the information theory literature only recently. 
The present result was first proven for positive discrete memory less channels |29l . Prior to that, apparently the 
only moderate deviations result in information theory was the work of He et al. |[30l - |[33l on the Slepian-Wolf 
problem. Polyanskiy and Verdu [34] improved the result in ||29Tl by relaxing the positivity assumption and extending 
it to Gaussian channels, among other contributions. More recently, moderate deviations in lossy source coding and 
hypothesis testing problems have been investigated by Tan |[35l and Sason 1136*1 . respectively. 

The result provided here improves upon the conference version |f29l by relaxing the positivity assumption and 
simplifying the argument. The proof is different from that of Polyanskiy and Verdii, who rely on methods from 
[4] and powerful results from probability theory. It is also different from that of He et al. and Tan, who use type 
theory. It is worth noting that standard finite block length bounds on the rate and error probability are insufficient 
to obtain a conclusive moderate deviations result, and new bounds, such as those obtained with the aforementioned 
techniques, are necessary. 

The organization of the paper is as follows. In Section [II] we define the relevant notions and state our result, 
Theorems 12.11 and 12.21 Section IIII-AI cotains the proof of the direct part, and Section IIII-BI contains the proof of 
the converse part. 

Notation: Boldface letters denote vectors; regular letters with subscripts denote individual elements of vectors. 
Furthermore, capital letters represent random variables and lowercase letters denote individual realizations of the 
corresponding random variable. Throughout the paper, all logarithms are base-e. Given a finite set X, V{X) denotes 
the set of all probability distributions defined on X. Similarly, given two finite sets X and X V(y\X) denotes the 
set of all stochastic matrices from X to y. Given any finite set X and for any P G V(X), S(P) denotes the support 
of P. The sets R, R + and M + denote real, non-negative real and positive real numbers, respectively. The set Z + 
denotes positive integers. We follow the notation of Csiszar-Korner lf2Tl for the fundamental information-theoretic 
notions. 



II. Definitions, Statement of the Main Result and the Auxiliary Results 
A. Definitions 

Given W G V(y\X), (/,</?) denotes a code, with /(•) (resp. </?(•)) being the encoding (resp. decoding) function. 
For a given code (/, ip), e m (W, f, ip) denotes the conditional probability of error for message m, e(W, f, ip) denotes 
the maximal probability of error and e(W,f,ip) denotes the average probability of error. E : M + x V(X) — > M 
denotes the function defined as 

i+p 

for all P G P(X) and p G M + (cf. lf2"2l eq. (5.6.14)]). For any R G K, the random coding and sphere packing 
exponents, E T (R, W) and Esp(R, W), are defined as 



E (p, P) :=-logW^ P{x)W(y\x)^\ 

yey \x£X ) 



and 



E r (R,W) := max max {-pR + E (p, P)\ , (2) 

Pev(x)0<p<i 

E SP (R,W)= max sup {-pR + E (p, P)} , (3) 

respectively. The following is a well-known result (e.g., [20, Theorem 18], lf2~Tl Ex. 2.5.23]) 

E SF (R,W)= max min D(V\\W\P), (4) 

Pev(x)vav{y\x):i(p-y)<R 
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for any R G R + . 

Given any W G V{y\X) and P G P(^), we define 

W{Y\X) 



a 2 (P,W) := Var PxW 

Using ©, we further defineH 



log 



EzexP^)W(Y\z) 



(5) 



o- 2 (W) := min a 2 (P,W), (6) 

PeP(A'):I(P;VK)=C 

and let P(W) denote some element of V(X) that achieves the minimum in ©. 

B. Statement of the Main Result 

The next two theorems comprise our main result. 

Theorem 2.1: For any W G V{y\X) with a 2 (W) > Cj| for any sequence of real numbers {e n } n >i satisfying 

(i) e n — > 0, as n — >• oo, 

(ii) e n y/n — > oo, as n — > oo, (7) 
there exists a sequence of codes {(/ n ,^n)}n>i that satisfies R n := log] ^ n ^ > C — e n , for all n G Z + and 

Theorem 2.2: For any G V{y\X) with o" 2 (VF) > 0, for any sequence of real numbers {e n } n >i satisfying (0 
and for any sequence of codes {(/ n , y? n )}n>i satisfying i? n = log ^"l > C — e n , we have 

1 1 
lirninf — \oge{W, f n , <p n ) > - (9) 
n^oo ne^ 2a z (W) 

Remark 2.1: Polyanskiy and Verdu [34 ] show that the assumption a 2 (W) > is necessary in order for 

-\loge{W,f n ,cp n ) 

to have a finite limit. If a 2 (W) = 0, then we conjecture that 

— log e(W,f n , (fin) 

ne n 

has a finite limit (see [34, Theorem 4]). O 

Remark 2.2: Our achievability proof follows from Gallager's random coding bound (e.g. ll22l Corollary 2, pg. 
140]), which states that for any rate R and block length n, there exists an (n, R) code (/, tp) such that 

e(W,f,tp) < Ae- nE >( R > w \ (10) 

Since n and R are arbitrary, we can let R = C — e n and approximate E r (-,W) around C via a Taylor series to 
obtain Theorem 12.11 This line of reasoning is made rigorous in Section IIII-AI 

The achievability argument is deceptively simple in that it obscures issues that must be confronted when proving 
the converse. To prove the converse, we would like to show that for any e n satisfying the hypothesis of the theorem 
and any a > 1, there exists sequences j3 n and 7„ satisfying 

^^0 (11) 
^logTn^O, (12) 

2 The minimum is well-defined owing to the continuity of cr 2 (-, W) (cf. Remark [23}- 
3 Since a 2 (W) > implies that C > Roc(W) > (e.g. (22] pg. 160]) we have C> 0. 
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such that for all sufficiently large n and all (n,C — e n ) codes (/, tp), we have 

e(W, /, <p) > lne -naE SP (C-e n -p n> W)_ (13) 

If one could prove such a bound, then one could obtain Theorem 12.21 by expanding Esp(-R, W) as a Taylor series 
around R = C and taking the appropriate limit. 

But it is not clear whether a bound like ( fT3l holds. The authors' recent refinement of the classical sphere-packing 
bound 11371 Theorem 2.1] establishes that for all e > 0, all fixed rates R below capacity, and all sufficiently large 
N, any constant composition^ (n, R) code (/, if) satisfies 

e(W, M>^& exp {-nE SP ( R - (i + ^V" W ) } . (14) 



Moreover, the n-dependence on the right side is essentially the best possible for a fixed R [391 . 

Although the rate backoff in this bound clearly satisfies (TTTb . whether the pre-factor satisfies (fT2l hinges R 
dependence of K(R). This dependence is not currently known, but it can be postulated via the following reasoning. 
In Strassen's regime, in which the rate approaches capacity at a speed of 1/ y/n, the error probability is asymptotically 
constant 0, and a Taylor series expansion of the sphere -packing exponent shows that the exponential factor in 
([141) is also asymptotically constant in this regime. If we assume that (fT4l holds in this regime, then it follows that 
the pre-factor must also be asymptotically constant, which suggests that K(R) might behave as 1/(C — R). If this 
is true, then the pre-factor would satisfy (fT2l . so (PT3l would hold. 

We show that ( fT3l indeed holds, although our proof does not involve characterizing how K(R) varies with 
Instead we prove (fT"3T ) directly using a particular set of classical information theory results, which do not appear 
to have been used in combination before, to prove a version of the sphere-packing exponent that is especially 
tight at finite block lengths and rates near capacity. The fact that our proof is similar to existing derivations of 
the sphere-packing exponent and uses well-known ingredients might give the impression that the result is routine. 
In fact, the required bounds are quite delicate, as the above discussion illustrates, and many conceptually-similar 
approaches to proving the sphere -packing exponent fail to give a conclusive moderate deviations result. O 



C. Auxiliary Results 

Lemma 2.1: Given any W € V{y\X) with no all-zero column, E (p,P) possesses the following properties: 

1) Given any P e V(X), E (p, P) is concave in p G R + . 

2) Given any PeP(X), 

dE ( P ,p) 



3) Given any P G V(X), 



4) Given any PeP(X), 



dp 

d 2 E (p,P) 



dp 2 

dE ( P ,p) 



= l(P;W). (15) 



= -a 2 (P,W). (16) 

p=0 



<I(P;W), V/>G R+. (17) 



dp 

5) 3E °q p P) is continuous over (p, P) G R+ x P(X). 

6) d 2E ^' P) is continuous over (p, P) G M + x V(X). 

7) 93E o£ P) is continuous over (p, P)eI+x V(X). 

Proof: The proof is given in the Appendix |A] ■ 
Remark 2.3: Note that for any given W G V(y\X),a 2 (-,W) is continuous on V(X), owing to items 3) and 6) 
of Lemma 12.11 O 

4 If the channel is symmetric, then the constant composition assumption can be dropped (cf. 1381 ). 
determining how K(R) varies with R is an interesting subject for future work. 
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III. Proof of the Results 

A. Proof of Theorem \2.1\ 

Let W G V{y\X) be an arbitrary stochastic matrix satisfying the conditions stated in the theorem. Without 
loss of generality, suppose that W has no all-zero columns. Further, let {e n } n >i be an arbitrary sequence of real 
numbers, satisfying (0. By (0 and the fact that C > 0, we have 



C - e n > 0, 



(18) 



for all sufficiently large n. Next, fix such an n. Gallager's random coding bound (e.g. |[22l Corollary 2, pg. 140]) 
implies that there exists (f n , ip n ), such that R n > R n := C — e n and 



e(W,f n ,(p n ) < 4exp \ -n 



max {E (p, P) - pR n } 

0<P<1 



(19) 



for all P G V(X). Therefore ( fT9l implies the existence of a sequence of codes {(/„., y n )}n>i» s -t- f° r ai l n £ 
Rn>C — e n and 

1 log 4 1 

■ log e(W, / n , 9? n ) < — 5 5- max {E D (p, P) - pR n } , 

net. ti. o<p<i 



net 



(20) 



for all sufficiently large n and any P G V{X). Hence, it suffices to prove that ([8]) holds for this particular sequence 
of codes in order to conclude the result. 

Using Taylor's Theorem, along with (fT31) and (fT6l) (cf. items 2) and 3) of Lemma 127X1) . for any p G M+, we have 



(21) 



for some p G [0, p]. Next, let p n 



o- 2 (Ty) 



, for all n G Z+. Then yields, 



max 

0<p< 



^[Mp.Hw^-pRu) 



> 



2a 2 (W) 6a Q (W) 



d 3 E (p,P(W)) 



dp 3 



(22) 



for all sufficiently large n and for some p n G [0,p n ]. 

Next, note that p n < 1, for all sufficiently large n, since lirmj^oo e n = (cf. (i) of ©) and a 2 (W) > 0. We 
define 

d 3 E (p,P) 



M := max 

(p,p)e[0,i]xP(^) 



dp 3 



(23) 



Owing to item 7) of Lemma 12.11 the maximum in d23l is well-defined and finite. Therefore, (|22l and (l23l imply 
that 



max 

0<p< 



.{E ( /9 ,P(W))-pi? n } 



> 



2a 2 (iy) 6a 6 (^) 



(24) 



for all sufficiently large n. 

Substituting (|24]> into (f20]) yields 



1 log 4 

— loge{W,f n ,ip n ) < 



nc 



ne 2 n 2a 2 {W) 



1 - M 



3a*{W) J 



(25) 



which, in turn, implies (recall (O and (l23l ) 



1 



limsup^-loge(Vy,/„,v9 n ) < - 2 , w , , 
which is ([U) and hence we conclude the proof. 
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B. Proof of Theorem \ 

Let W and {e n } n >i be as in Section UlI-AI Further, let {(/„, ip n )}n>i be an arbitrary sequence of codes with 
log ^" ! := R n > C — e n , for all n G Z + . Observe that owing to standard arguments used to switch from the maximum 
to average error probability (e.g. |[T8l eq. (4.41)]), it is sufficient to show the conclusion for the maximum error 
probability, i.e., 

lim inf — j log e(W, f n ,<p n )>- 2mn , (26) 
rwoo ne^ 2a 2 (W) 

in order to prove ©. By similar reasoning [21, pg. 171], we can assume that the code is constant composition. 

Next, we briefly outline the rest of the proof, which consists of three steps. The first step is to prove a strong 
converse theorem, Lemma 13.11 tailored to the particular situation at hand. The second step is to use Lemma 13.11 
and "change of measure" to prove (TT3l (cf. Remark I2T21 ). The final step is to approximate the exponent in (fT3l) via 
a Taylor series to conclude the result. 

Remark 3.1: Lemma |3~T1 which could be of independent interest, is derived from Wolfowitz's converse to the 
channel coding theorem ||40l . Although our version requires that the code be constant composition, an assumption 
not required by Wolfowitz, it shows that the error probability must be near unity if the rate exceeds the mutual 
information induced by the code. Wolfowitz requires the rate to exceed capacity. O 

Remark 3.2: One of the well-known change of measure arguments is Marton's iHTl eq. (12)]. Although Marton 
originally applied it to rate distortion, the application to channel coding is obvious. It does not seem sufficient to 
prove (TT3T ), however. Instead, we use a change of measure argument based on the log-sum inequality, given by 
Csiszar and Ktirner |[2T1 pg. 167]. O 

Define the constant A as follows: 

vjx\xy 

Q(X) . 

where Q(y) := Y^xex P{ x )V{v\ x )> e 3^- Note that, since the cost function is continuous in the optimization 
variable and we work with finite alphabets, the maximum in (|27T ) is well-defined and finite. 

Lemma 3.1: (Strong Converse). Let (/, ip) be an arbitrary constant composition code with block length n, common 
type P, and rate R > 0. Let V G V{y\X) be an arbitrary stochastic matrix satisfying I(P; V) < R — 25, for some 
5 > 0. Then, we have 

e(VJ^)>l-^-e- n \ (28) 



A := max Var 

(PxV)£T(X)xV(y\X) 



log 



+ 1, (27) 



where A is defined in (1271) . 

Proof: The proof is given in Appendix IB1 
Next, fix some < 7 < 1/2. Let ip G R + be defined as 

2A 



Note that for all sufficiently large n, 



^ 2 :=—. (29) 

7 



0<C-(e n + ^l, (30) 



-i>\fn 



< 7/2. (31) 

As a direct consequence of the Strong Converse lemma (with the choice of 5 = tp/y/n), for any V G V(y\X) 
satisfying I(P n ; V) < Rn - we have 

3m GM = {!,..., \2 nR -]}, s.t. e m (V,f n ,cp n ) > 1 - 7, (32) 



for all sufficiently large n, such that (1301 ) and (1311 hold. Note that n does not depend on the specific choice of V. 
Fix a sufficiently large n such that d30l ) and d3~TT ) hold. 

Lemma 3.2: (Change of Measure). Let (/, ip) be an arbitrary constant composition code with block length n and 
common type P n . Then 

.W/.,*0 2«p{-( ^ /*0«W + ijlz4 m . ,33, 

[ \V£V(y\X):I(P n ;V)<R n -^L { 1~7 n(l - ' ' 
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for all sufficiently large n such that d30b and (f3TT > hold, where h(-) is the binary entropy function, i.e. h(p) := 
plog(l/p) + (1 - p) log(l/(l - p)), Vp G [0, 1]. 

Proof: The argument is due to Csiszar and Korner (e.g. ETl pg. 167]), and we state it for the sake of 
completeness. Fix n and let V be any channel such that 

\/n 



By the log-sum inequality (e.g. [21, pg. 48]), for any message m, we have 

T/n/ -1, m »/ \ \ l ^(p^MI^M) T/n .. _ x . ,, C| „, y n (((^- 1 (m)) c |x n (m)) 
F n (v? x (m) x n (ro))log — , n ' — V4r+V (fy (m)) c x n (m)) log — ,\ \\\ — V^rr 
^ y M v " W n {ip- l (m)\yL n (m)) v v W n {{(p- 1 {m)) c \yi n (m)) 

< D(V n \\W n \x. n (m)), 

where ip~ l {m) denotes the decoding region for the m-th message and ((/? _1 (m)) c denotes its complement. This, 
in turn, implies that 

^ W ((y" 1 M) c |x w M)log ^ ^ <D(^||^|x"(m))+h(y"(y- 1 (m)|x"(m))). (34) 



Applying this inequality to a message satisfying (1321 gives 
Equation (l33l . along with (0]), implies that 



where / 

2tp 



5 n := e n 1 + 



for all Note that this establishes £[3]). We define 

2ip , 
Q n := 1 H , Vn S Z , 

and note that since e n ^/n — > oo as n — )• oo (cf. item (ii) of ([7])), a n — )• 1 as n — )• oo. Therefore, 5 n — > as n — > oo 
(cf. item (i) of ©). 

The third and final step of the proof is to approximate the exponent on the right side of d35l ). To this end, first 
note that if the rate is above the critical rateQ i.e. R > R cr , then E SP (R, W) = E r (R, W) (e.g. [22, pg. 160]), 
which, in turn, implies that 

E SP (R, W) = EJR, W) = max max {-pR + EJp, P)\ , (36) 

Pev(x) o<p< l 

by recalling (f2]). 

Further, since a 2 {W) > 0, one can infer that (e.g. |[22l pg. 160]) R cr < C and hence for all sufficiently large n, 
C — 5 n > R CI . This observation, coupled with d36l ), ensures that for all sufficiently large n, we have 

E SP (C-S n ,W)=E I (C-6 n ,W) = max max {-p[C - S n ] + E (p, P)} . (37) 

PeV{X) o<p<i 

Proposition 3.1: (Sphere-packing exponent around C) 

r E SP (C-5 n ,W) 1 

hmsup -= < 2m/V (38) 

n->oa 2a 2 (W) 

Proof: The proof is given in the Appendix [Cj ■ 
6 See (22] pg. 160] for the definition of R a - 
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Equipped with Proposition 13.11 we conclude the proof as follows. Recall that 5 n = e n a n , where a n > 0, for all 

n £ Z + and a n — > 1 as n — > oo. Hence, 



E SP (C -5 n ,W) E SP (C -5 n ,W) 
lim sup p5 = lim sup 5 



51 



n— >oo 



Since lim^oo ne 2 = 00 (cf. item (ii) of (0), (l35l ). (l37l ) and (l39l imply that 



lim inf — 5- log e(W, f n , <Pn) > 
n— >oo ne 



1 1 



(39) 



(40) 



2a 2 (W) I-7' 

Since 0<7<l/2is arbitrary, letting 7 — > in the right side of (l40l yields (1261 ). which was to be shown. 

Appendix A 
Proof of Lemma [2~T1 

Consider any W £ V{y\X). For all ye J, define 

X y := {x £ X : W(y\x) > 0}. (41) 

Observe that owing to the no all-zero column assumption on W and (|4TT ). for all y £ y, X y ^ ®. Moreover, for 
any P £ V{X), there exists y £ y with X y n <S(P) / 0. 
For all y £ y, define 



fv 



i + x V{X) -> R+, s.t. /„(p,P) := J]) P(x)W(y|x)oT7T, V(p,P) £l + x P(*). 

me* 

Evidently /,,(•, •) is continuous on M + x V{X). Also, straightforward calculation reveals that 
df y (p,P) = __L^ £ P(x)W(y\x)^logW(y\x), 



X^Xy 



d2f in2 P) = (TTpj3 E P(x)W(y\x)T^l og W(y\: 



dp 2 

d 3 fy( P ,P) 

dp 3 
Further, 



x e Xy 



2 + 



\ogW{y\x) 
(1 + P) 



(42) 

(43) 
(44) 
(45) 

(46) 

Equation <@6]>, coupled with (O, (04)) and 05), implies that 9fy ( p ' P \ 92f ^' P) and 83/ g^' P) are continuous for 
all (p,P) £ x V(X). 
For all y G define 

5y : K+ x P(A-) -> R +J s.t. g y (p,P) := f y (p,P)^ 1+ P\ 

where f y (-, •) is defined in (|42~1) . It follows that g y (-, •) is continuous on M + x V(X). 
Note that 



1 , 1 

^ ~ p y 2^ P{x)W{y\x)^ logW(y|: 



' 61ogW(y|a) (log VF(y|x)) 21 
. (1 + P) (1 + p) 2 



VP£P(4 s.t. 5(P)n^ = 0, f y (-,P) = 0. 



(47) 



VPeP(4 s.t. S(P)n# y = 0, g y {;P) = 0. (48) 
Consider any P G with 5(P) D X y ^ 0. By noting P) = e (i+p)iog/ !( (/ J ,F) j one can check ^ 

<9ffy(p, P) 



<9p 



(I + P) 



dfv(p,P) 
dp 



d 2 g y (p,P) _ dg y (p,P) 



dp 2 



dp 



fy(p,P) ' 

dfy( P ,P) 



fy(P,P) 



+ lOg fy( P ,P) 

+ log f y (p,P) 



(49) 



+ S>,P) 



dfy(p,P) 

1 dp 
J fy(P,P) 



+ (1 + P) 



dp 2 
/y(P,^) 



- a/»(p.-p) \ 2 
dp ] 

fy(P,P) 



(50) 
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d 3 g y (p,P) _ d 2 g y (p,P) 



dp 3 



dp 2 



+ 



dfy( P ,P) 

dp 
fy(P,P) 



+ lOg fy(p,P) 



+ 



dg y (p,P) 



Op 



9fy( P ,P) 

dp 



fy(P,P) 



+ 9y(p,P) 



9 2 fy( P ,P) 

dp 2 



dfy{p,P) 
dp 



dfv(s,P) 

dp 

fv(P,P) 

v 2 



+ 2(l + p)| 

9fy(p,P) 



9 2 fy(p,P) 

dp 2 
fy(P,P) 



fy(p,P) \fy(p,P) 



3-2 



dp 



fy(P,P) 



9 3 fy(p,P) d 2 fy{ P ,P) dfy{ P ,P) 



For any y 6 y, define 



dp 3 



dp 2 



dp 



fy( P ,P) fy( P ,P) 2 



Wmin(y) := min min W(y\x), 
yey xeXy 

w max (y) := max max W(y\x). 

yey x&Xy 



From d43l ), by using d52~l ) and d53l ), we infer that 

a/ y (p,p) < /,(p,p) 



5/? 



log' 



(y)' 



9/„(p,P) > fyiPlfl log 



dp 



(1 + P) 2 ^max 



(51) 

(52) 
(53) 

(54) 
(55) 



Consider any sequence {(pk, Pk)}k>i m M+ x'P(Af) with ^(P/JnA'y 7^ for all k e Z + and (pk,Pk) — ► {Po, Po) 
for some P G ) £l + x V{X) with S(P ) n = 0. Using <T54]) and d55l . we deduce that 



df v (p,P k ) 



log 



< lim inf ■ 



p=Pfc 



(1 + Po) 2 w max (y) fc^OO f y (pk,Pk) 



< lim sup ■ 



dfy( P ,P k ) 

dp 



P=Pk 



< 



log 



(56) 



fc^oo fy(Pk,Pk) (1+Po) 2 Wmin(y) 

Note that ([56]) is evident if <S(P G ) V\X y ^§. 

Lemma A.l: Given any y £ y, ag "^' P) is continuous for all (p, P) € R+ x P(A'). 

Proof: Fix any ye}'. Consider any (p G , P D ) G M + x P(X). 
Note that if <S(P ) (1^/0, then by recalling the continuity of f y (-, •), 9fv jf p P ^ and g y (-, •), (@9]) ensures that 

d9v Qp'^ is continuous at (p ,P ). Hence, suppose <S(P ) D = 0. 

Let {(p/c, Pfc)}fc>i be arbitrary with limjt_ 5l00 (pfc, P^) = (p 0J P ). Observe that d48] ), along with (1421 and d47] ), 
ensures that 

<h, " ilKPl/) =0, if 5(P fc )n^ = 0. (57) 



dp 



Consider any subsequence {(pfc„, Pfc n )}n>i- Now, if all but a finite number of Pk n satisfy S(Pk n ) D X y = 0, then 

= 0, (58) 



lim 

n— >oo 



9g y (p,PkJ 



dp 



P=Pkr, 



owing to (ISTb - Suppose this is not the case. One can verif>0 that 

dg y (p,PkJ 



lim 

n— >oo 



dp 



(59) 



p=pk„ 



by using the continuity of f y (-, ■) and g y {■,■), along with (06]), dHJ, (|49]» and (1561) . 

'Passing to a further subsequence {Pk nm }m>i such that S(Pk„ ) n / fl, for all rn 6 Z + , if necessary. 
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Combining d58l ) and d59l) , we conclude that 

dg y (p,Pk) 



lim 







dg y (p,P ) 



p=Pk 



dp 



P=Po 



that implies the continuity if S{P ) C\X y = 0. 
For any y G J 7 , define 

:= max{| log uj min (y)\, | log w max (y)|} G R + , 

where w m i n (y) and u; max (y) are as defined in d52l ) and (|53l ), respectively. 
From (|44l) . by using (f60l >. we infer that 

d 2 f y (p,P) < 2/ y (p,P) iog^ ma x(y) 



(60) 



9p 2 



a + pY 



d 2 f y (p,P) > 2/ 2/ (p,P)loga; min (y) 



(61) 
(62) 



3p 2 - (1 + p) 3 

Consider any sequence {(p k , Pk)}k>\ in K+ x P(;f ) with 5(Pfc) n ^s/ f° r all and (p k ,P k ) — > (p a , P a ) 

for some (p OJ P ) G R+ x P(,Y) with <S(P ) n X y = 0. Using <gT) and d62~i we deduce that 



d 2 fy( P ,P k ) 

2 dp 2 
3 -r: ^log Wmin < lim inf 



p=pk 



(l + Po) 2 



fc-KX) fy(pk,Pk) 



< lim sup 



d 2 fy{p,P k ) 

dp 2 



k— too fy(Pki Pk) 



P = Pk _ < 2 log qj max Q/) u;(?/) 2 ^ 



:i +/0o )3 (i+ Po )4 



(63) 



Note that d63) is evident if S(P )f]X y ^ 0. 

Lemma A.2: Given any y G y, 9 9 g^' P ^ is continuous for all (p, P) G K+ x V(X). 
Proof: Fix any y <E y. Consider any (p Q , P Q ) G R + x P(A'). 

Note that if S{P ) C\X y ^$, then, by using the continuity of /„(-, •), ^jffi' , 9y(; and ^2^1, ® 

imphes the continuity of — dp2 at the point (p Q , P ). Hence, suppose S(P a ) C\X y = $. 

Let {(pk, Pfc)}fc>i De arbitrary with lim/ c _ >00 (pfc, P^) = (p ,P ). Observe that d48l ), along with (l42l and d47T ), 
ensures that 

ag ^ Pfc) = 0, if 5(P fc ) n ^ = 0. (64) 

Consider any subsequence {(pfc„, Pfc„)} n >l- Now, if all but a finite number of Pk n satisfy S(Pk n ) C\X y = 0, then 

d 2 9 y (p,Pk n 



lim 

n— >oo 



dp 2 



P=Pk„ 



owing to (l64l . Suppose this is not the case. We also have@ 

d 2 g y (p,Pk n 



lim 

n— >oo 



dp 2 



0, 



0, 



(65) 



(66) 



p=pk„ 



by using the continuity of f y (-, •), g y (-, •) and ®2jApA ^ a i ong w ith d4"6l d48i d49l ), (f50l. d56l ) and d63l ). 



Combining (1651) and (1661) . we conclude that 

d 2 9y(p,Pk) 



lim 







d 2 g y (p,Po) 



P=p k 



dp 2 



P=Pa 



that implies the continuity if <S(P D ) fl = 0. 

8 Passing to a further subsequence {Ph„ }m>i such that S(Pk n ) f\X y for all m 6 Z + , if necessary. 
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Note that from (@3]>, by using ([52]), ([53]) and ([60]), one can show that 

1 (logw min (y)) 3 



d 3 f y (p,P) < 1 



dp 3 



S2 P(x)W(y\x) (1 +^ 



X&Xy 



d 3 fy( P ,P) > 1 



<9p 3 



P(x)W(y\x)vk 



6 log 
6 log 



(y) (i + p) 2 

1 _ 6cJ(j/) 2 _ (logowfa)) 3 

^max(y) (1+p) (I + P) 2 



(67) 
(68) 



Consider any sequence {(p k , P k )}k>i in K.+ x V(X) with 5(^^)0^ 7^ for all k G M + and (p k ,Pk) — ^ (Po, Po) 
for some (p G , P G ) G R + x P(;t) with 5(P ) n X y = 0. Using ([57]) and ([68]>, we deduce that 



(I + P) 4 



6 log 



6w(y) 2 (loga; max(y) ) J 



Wmax(2/) (1 + p) (1 + P) 2 



d 3 f v (p,P k ) 



dp 3 



< liminf . 

k^oo Jy{Pk,Pk 



P=Pk 



9 3 fy(p,P k ) 

dp 3 

< lim sup — — , 

fc^oo Jy{Pk,Pk) + 



< 



1 



6 log ■ 



1 



i(y) U + /?) 2 



(69) 



Note that ([63]) is evident if S(P )nX y ^ 0. 

Lemma A.3: Given any y £ y, 9 9 g^' P ^ is continuous for all (p, P) G x V{X). 

Proof: Fix any y £ y. Consider any (p G , P Q ) G M + x P(A'). 
Note that if S(P ) n ^ ^ 0, then, by using the continuity of /„(., •), 0, 

and - g" p 2 P ' - , ([5T]) implies the continuity of 9 9 Qpt^ at the point (p D , P Q ). Hence, suppose S{P Q ) C\X y = 0. 

Let {(pfe, -Pfc)}fc>i be arbitrary with lim/ c _ >00 (pjt, P^) = (p ,P ). Observe that (l48l . along with (l42l and (l47l ). 
ensures that 

a g y ( P , p k \ = ^ . f 5(Pfc) n ^ = (70) 

p=pfc 



<9p 3 



Consider any subsequence {(pk n , Pk n )}n>i- Now, if all but a finite number of Pj tn satisfy S(Pk n )C\X y = 0, then 

d 3 g y (p,Pk n 



lim ., 

n— >oo <7p 



0, 



(71) 



p=pk„ 



owing to d70l) . Suppose this is not the case. Further, we have (passing to a further subsequence {Pk n } m >l sucn 
that S(Pk nm ) n X y 7^ 0, for all m G Z + , if necessary) 

d 3 9y{p,Pk n ) 



lim „ 

n— >oo op 



o, 



(72) 



by using the continuity of f y (-, -),^(-, •)> and ^tp^ * alon § with ©> ®> 61, 62). <ED>, ©, dSJ 

and ([69]). 



Combining (1701 and (1711 ). we conclude that 

d 3 g y (p,Pk) 



lim 







d 3 g y (p,P ) 



p=pk 



dp 3 



fc-s-oo dp 3 
that implies the continuity if S(P a ) C\ X y = $. 

Lastly, recalling the definition of E Q (p, P) and (l47l) . it is easy to see that 

E (p,P) = -log 9v(P,P)- 
yey 



P=Po 



(73) 
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Using ( |73l , one can check that 

dp 



dE (p,P) l^yey 



dp 52yey9y(p, p )' 



(74) 



d P 2 Ey^y9y(p,P) \ dp 

d 3 E (p,P) = Zyey^W^ 1 + z dE {p,P) d 2 E (p,P) _ / dE (p,P) \ 3 
d P 3 Y,yay9y{p,P) dp dp 2 \ dp ) 

The assertions of the lemma now follow: 

1) For any given P 6 V(X), the concavity of E c (-, P) on R + can either be proven by checking the non-positivity 
of 9 E dp2 ,P \ given in (|75T ). or directly applying Holder's inequality (e.g. lf22l Appendix 5B]). 

2) By evaluating (|42l . (1431 ). (1471 and (l49l at p = and then plugging the result into d74l . one can easily check 
the validity of the claim. 

3) By evaluating J42l) . (l43l . (1441 ). J47l) . d4"9l and ([50]> at p = and plugging the result into (f75l) . one can check 



the validity of the claim after some algebra. 



4) Fix any P G V(X). The concavity of E c (-, P) on R + (recall item 1) above) ensures that - ^2'^ < 0, for 



all p G R + . This, coupled with item 2) above, implies the claim. 

5) The continuity of g y (-, •) on P 6 T'(A') x R + and Lemma [A. 1 1 along with d74l . imply the claim. 

6) The continuity of g y (-,-) on P G 'P(Af) x R + , Lemma IA.2I and item 5) above, along with (|75T ). imply the 
claim. 

7) The continuity of g y (-, •) on P G V{X) x R + , Lemma [A31 and items 5) and 6) above, along with (176*1 ), imply 
the claim. 



Appendix B 
Proof of Lemma |3~T1 

The proof follows similar steps to that of lf22l Theorem 5.8.5]. Let (/, ip), V G V(y\X) and 5 > be as in the 
statement of the lemma. Define 

G{m) := \ l0g Q"(Y") > » V) + 6]j, (77) 

far any m € A* := {1, ... , [2™^ }, where Q*(y») := n? =a Vy n G 3^ n along with Q(y) := P(x)F(y|s). 

Also, for the sake of notational convenience, define i(x,y) := log , for any (x,y) £ X x y. Note that we 
have 

'° g W = S'° g QB = X>*<"*»'>- <78) 

for all m G .M, where x n (m) denotes the codeword of the code corresponding to the message m. Hence, for any 
m G M., we have 

n 

E vn [f(x"(m),Y")|x"(m)]=^ Ey [i(xfc(m), Y k )\x k (m)} (79) 

fe=i 

= £ N(x\ X n (m)) V(y\x) log (80) 
xex y ey 

= n ^ P{x) V(y\x) log (81) 

xax yey ^ y > 
= nl(P;V), (82) 

where (|79l follows from (l78l . (T80b follows from the definition of iV(x|x n ), which denotes the number of occurrences 
of the symbol x G X in the string x n , and (f8"Tl ) follows from the definition of the type P. 
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Next, let (p 1 (m) C y n denote the decoding regions of (/, tp), Vm G M.. We have 

i - e(v, f ,<p)=J2 E ^ n (y> n (™)) 

meM y n e<p- 1 (m) 

= E r^l E WKM) + E ]4 E W|x»(m)), (83) 

Recalling (1771 ). for any y n € G{m) c , we have 

r(y n |x n (m)) < Q n (y") exp {n [I(P; V) + <J]} , 

which, in turn, implies that 

£ -L ^ V»(y»|x»(m)) < E r^T E 0»(y»)e»K^ 

meM ' ' y"ei^- 1 ( m ) nG '( m ) c meM ' ' y"e< y 3- 1 ( m ) nG ( m ) c 

* E r^T E W)^™ 

meM ye^H" 1 ) 



exp{n[I(P;y) + <5]} 



E E Q n (y n ) 



meM y"ei / 3- 1 (m) 

< exp{-n[i?-I(P ; y) -5}} (84) 

< e" n<5 , (85) 

where (l84l follows from the fact that the decoding regions are disjoint and Q n is a probability measure on y n and 
([851 follows from I(P; y) < - 25 assumption. 
Next, note that for any m G 

^ V n (y n \x n (m)) < Yl V n (y n \x n (m)) 

y n eip- 1 (m)nG n (m) y"eG„(m) 

= V n {G n {m)\* n {m)}. (86) 

Further, using Chebyshev's inequality (recall ( 1771 ), (1781 ) and (|82l), for any meMwe have 

V n {G n \^{m)} < ELiVarKx fc M;y fc )|x fc M] 

1 jl-V^^T^i i w\ 2 V(y\x k (m)) 1 A / ^ F(y|x fe (m)) \ 

= ^ -gj/ fe(m))1 ° g — Q^) nglE y (^(m))log Q(y) 1 > 



1 

~~ n5 2 



2 ■ 



»SS — («£g F(!,Wm))log ^3tor 



i | E P W E ^(vM lo e 2 - ( E p w E lo s 

[xeA' yey v ; \a;s* yey 



(87) 

V(y\x) 1 



Q(y) 



(88) 



Var [log ^> 



Q(Y) 



(89) 

where (l8"7T ) follows from Jensen's inequality and (1881 ) follows from the definition of P. Plugging (l89l into (|86"1 ) and 
recalling ( [271 ) yields 

VmeX, V y n (y n |x n (m)) < — ^. (90) 

y n 6¥;- 1 (m)nG„(m) 
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Plugging d85J) and (|90]> into (1831) , we deduce that 



e(V,f,<p)>l 



A 

1? 



n 



that is (1281) . which was to be shown. 



Appendix C 
Proof of Proposition 13. II 

Let P n and p n achieve the maxima in (l36l ) at rate C — S n , i.e., 

E SP (C - S n , W) = - Pn (C - S n ) + E (p n , P n ). 

Now Esp(C — 5 n , W) > for all n, which is evident from ©. This implies that p n > for all n. Since E (p, P) 
is concave in p, it follows that 

dE (p,P n ) 



C - S n 



dp 



(91) 



for all n. 

Our proof of Proposition 13.11 will use the following lemma. 
Lemma C.l: (a) Any limit point of {P n } is capacity achieving. 

(b) lim^oo p n = 0. 

(c) limsup n _ >oc < ^tj. 

Proof: Consider arbitrary subsequences {-P nfc }fc>i and {/5 nfc }fc>i and note that, owing to the compactness of 
V{X) and [0, 1] (switching to a further subsequence, if necessary), we may assume that 



lim P n 

k— >oo 



Jim Pn k 



P0, 



(92) 



for some P G V(X) and p G [0, 1]. 

Now d9~TT ) and part 5) of Lemma 12.11 together imply that 

„ aE o ( P ,p ) 



9p 



P=Po 



On the other hand, part 4) of Lemma 12.11 implies that 

dE o (p,P ) 



dp 



<l(P ;W)<C. 



(93) 



P=Po 



It follows that P is capacity achieving. Since the subsequence was arbitrary, this establishes (a). 



£> 2 E„(p,P ) 



p=0 



< by part 3) of 



Since Pq is capacity achieving, the assumption that a (W) > implies that — ^ 

Lemma |2~T1 Then items 1) and 2) of Lemma |2~T1 imply that the first inequality in (|93l holds with equality if and 
only if po = 0. Since the subsequence was arbitrary, this establishes (b). 

Next consider — q " fc , viewed as a function of p. This function equals \{P nk ;W) at p = by part 2) of 
Lemma 12.11 and it equals C — 5 nk at p Uk by (f9Tb . It is differentiable in p by part 6) of Lemma 12.11 Thus by the 
mean value theorem, there must exist a p nk in [0, p Hk ] such that 

d 2 E (p,P nk 



dp 2 



P=Pn 



l(P nk ;W)-C + 5 nk 



< 



Pn k 



Now by parts 3) and 6) of Lemma 12.11 

d 2 E ( P ,p nk ) 



lim 

k— >oo 



dp 2 



d 2 E o (p,P ) 



P=Pn. k 



dp 2 



-a 2 (P ,W) < -a 2 (W). 



p=0 
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Combining the last two inequalities gives 

l imsU p^i < _L^. (94) 

n->oo n , k a z (W) 

Since the subsequence was arbitrary, this establishes (c). ■ 
We are now in a position to prove Proposition 13. II For any sufficiently large n, Taylor's Theorem gives (recalling 
items 2) and 3) of Lemma 12.11 ) 



E SP (C -S n ,W) := -p n [C-S n ]+E {p n ,P n 



fitr> TyT/A /"> i X 1 (Pn) 2 2 ,r> xrr\ , (Pn) S 9 3 E (p, P n ) 
p n [I(P n ; W) - C + 5 n \ —<? (Pn, W) + 



2 v ' ' 6 dp 3 
for some p n G [0,p n ]. If we use the constant M defined in (|23l , then we eventually have 

E SP (C - S n , W) < Pn [l(P n ; W)-C + 5 n }- ^-a 2 {P n , W) + . 

2 b 

Since we must have l(P n ; W) < C, this yields 

E SP (C - «S„, W) < p n S H - ^» 2 (F„, W) + ( -£?^- 

< .„pj A -6( P „«')U WlM 



peR+ 



(95) 



2 v 'J 6 
5 « (96) 



2a 2 {P n ,W) 6 
Using (l96l ) and parts (b) and (c) of Lemma IC.ll we deduce that 

hm sup < hm sup 

n— s>oo n n—toc ^<J \*ni vv ) 

5 2^wy (97) 

where d9Tb follows from the continuity of <r 2 (-, W) on 'P(A') (parts 3) and 6) of Lemma 127Tb . Lemma ICTT a) and 
the definition of a 2 (W) (cf. ©). 
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